-
Notifications
You must be signed in to change notification settings - Fork 81
[WIP] [feat] miles lora megatron backend #409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @yushengsu-thu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request is a work-in-progress feature that integrates a Megatron backend for LoRA training within the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new reproducibility script for running a LoRA fine-tuning experiment on the Qwen2.5-0.5B model with the GSM8K dataset using the Megatron backend. The script is well-structured, using bash arrays to organize command-line arguments for clarity.
My review focuses on improving the robustness and correctness of the script. I've provided two main suggestions:
- Refining the initial cleanup logic to be more robust and safer, by preferring graceful shutdowns and highlighting the risk of using a broad
pkillon all python processes. - Correcting the environment variable used to disable Python's output buffering from the non-standard
PYTHONBUFFEREDto the correctPYTHONUNBUFFERED.
These changes should make the script more reliable and adhere to better practices.
| pkill -9 sglang | ||
| sleep 3 | ||
| ray stop --force | ||
| pkill -9 ray | ||
| pkill -9 python | ||
| sleep 3 | ||
| pkill -9 ray | ||
| pkill -9 python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleanup logic at the beginning of the script is quite aggressive and could be improved for safety and robustness.
- Graceful Shutdown: Using
pkill -9(SIGKILL) immediately prevents processes from cleaning up properly. It's better to first try a graceful shutdown withpkill(SIGTERM). - Broad
pkill:pkill -9 pythonis very broad and could terminate unrelated Python processes, which is risky outside of a completely isolated container. - Redundancy: The repeated
pkillcommands suggest the cleanup might be fragile. A single, more robust cleanup sequence is preferable.
| pkill -9 sglang | |
| sleep 3 | |
| ray stop --force | |
| pkill -9 ray | |
| pkill -9 python | |
| sleep 3 | |
| pkill -9 ray | |
| pkill -9 python | |
| pkill sglang | |
| ray stop --force | |
| sleep 5 # Wait for processes to terminate gracefully | |
| # Force kill any remaining processes. | |
| # Note: `pkill -9 python` is broad and can be risky. | |
| pkill -9 sglang | |
| pkill -9 ray | |
| pkill -9 python |
| set -ex | ||
|
|
||
| # will prevent ray from buffering stdout/stderr | ||
| export PYTHONBUFFERED=16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The environment variable PYTHONBUFFERED is not standard. The correct variable to disable output buffering for Python is PYTHONUNBUFFERED. Setting it to any non-empty string (conventionally 1) will have the desired effect of making stdout/stderr unbuffered.
| export PYTHONBUFFERED=16 | |
| export PYTHONUNBUFFERED=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds Megatron backend support for LoRA training to the Miles project. The implementation includes disk-based weight synchronization and tensor-based weight update mechanisms for the SGLang rollout engine, representing a work-in-progress feature addition.
Key Changes:
- Added Megatron backend integration for LoRA training
- Implemented disk sync weight functionality
- Added tensor-based LoRA weight update mechanism (pending upstream SGLang PR)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| --rollout-shuffle | ||
| --rm-type math | ||
| # --num-rollout 100 | ||
| --num-rollout 10 # onyl train 10 stesp |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: "onyl" should be "only". The comment should read "# only train 10 steps".
| --num-rollout 10 # onyl train 10 stesp | |
| --num-rollout 10 # only train 10 steps |
| --rollout-shuffle | ||
| --rm-type math | ||
| # --num-rollout 100 | ||
| --num-rollout 10 # onyl train 10 stesp |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: "stesp" should be "steps". The comment should read "# only train 10 steps".
| --num-rollout 10 # onyl train 10 stesp | |
| --num-rollout 10 # only train 10 steps |
| CKPT_ARGS=( | ||
| --hf-checkpoint /root/Qwen2.5-0.5B-Instruct/ | ||
| --ref-load /root/Qwen2.5-0.5B-Instruct_torch_dist/ | ||
| # Uncomment to save checkpoints (required for LoRA) |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states "Uncomment to save checkpoints (required for LoRA)" but the checkpoint saving arguments on lines 25-26 are already active (not commented out). This creates confusion about whether checkpoints are being saved. Either update the comment to reflect that checkpoints are enabled, or comment out lines 25-26 if they should be optional.
| # Uncomment to save checkpoints (required for LoRA) | |
| # Save checkpoints (required for LoRA). Adjust path/interval as needed. |
| --target-modules "q_proj,k_proj,v_proj,o_proj" | ||
| # --target-modules "q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj" | ||
| # --lora-sync-from-tensor # Use tensor-based sync (more efficient) | ||
| # Uncomment to share base model between actor and ref (saves memory) |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states "Uncomment to share base model between actor and ref (saves memory)" but the --share-ref-base-model argument on line 41 is already active (not commented out). This creates confusion. Either update the comment to reflect that sharing is enabled, or comment out line 41 if it should be optional.
| # Uncomment to share base model between actor and ref (saves memory) | |
| # Share base model between actor and ref (saves memory) |
…cient Lora should be supported
Description
megatron-bridge) and fix bugs(Update LoRA weights to the SGLang rollout engine via tensor, which is faster than the previous disk sync approach)
Waiting for this sglang PR to be merged: Update LoRA Weights via Tensor sgl-project/sglang#16226
/python/sglang/srt/models/qwen2.py- line: 611--offload-rollout-level kv_cache weightpart is not right.Pre-request
radixark/miles:latest(Digest: 6e467519505d)Git clone this megatron-bridge branch: https://github.com/yushengsu-thu/Megatron-Bridge/tree/merged-megatron-0.16.0rc0
Testing
Related Issues, PRs:
Lora FSDP backend PR: #377
SGLang sync from tensor: sgl-project/sglang#16226
Code Style Compliance
.item(),.cpu(),.tolist()) in inference paths